Prework: Load Libraries, define constants, functions

Graphing Functions

Utility Functions

1 Load Data and Initial Inspection

2-A Univariate Analysis

Lets examine the numeric non categorical cols first

2-A (1) Customer_Age

Customer Age (dtype: int)

2-A (2) Total_Trans_Ct

Total_Trans_Ct (Total Transaction Count - last 12 m) (dtype: int)

2-A (3) Total_Trans_Amt

Total_Trans_Amt (Total Transaction Amount - last 12 m) (dtype: int)

2-A (4) Total_Revolving_Bal

Total_Revolving_Bal (Total Transaction Amount - last 12 m) (dtype: int)

2-A (5) Months_on_book

Months_on_book (dtype: int)

2-A (6) Total_Amt_Chng_Q4_Q1

Total_Amt_Chng_Q4_Q1 (dtype: float)

2-A (7) Avg_Open_To_Buy

Avg_Open_To_Buy (dtype: float)

2-A (8) Credit_Limit

Credit_Limit (dtype: float)

2-A (9) Avg_Utilization_Ratio

Avg_Utilization_Ratio (dtype: float)

2-A (10) Total_Ct_Chng_Q4_Q1

Total_Ct_Chng_Q4_Q1 (dtype: float)

2-B Univariate analysis

Now we examine categorical - numeric and non numeric cols

2-B (11) Total_Relationship_Count

Total_Relationship_Count (dtype: int)

2-B (12) Dependent_count

Dependent_count (dtype: int)

2-B (13) Months_Inactive_12_mon

Months_Inactive_12_mon (dtype: int)

2-B (14) Contacts_Count_12_mon

Contacts_Count_12_mon (dtype: int)

2-B (15) Card_Category

Card_Category (dtype: object)

2-B (16) Income_Category

Income_Category (dtype: object)

2-B (17) Marital_Status

Marital_Status (dtype: object)

2-B (18) Education_Level

Education_Level (dtype: object)

2-B (19) Gender

Gender (dtype: object)

2-B (20) Attrition_Flag (Target Variable)

Attrition_Flag (dtype: object)

2 C Encoding Categorical

2 D Bivariate analysis

Further bivariate investigation for non categorical columns

Numeric Variables with Target

Further bivariate investigation for categorical columns

Categorical Variables with Target

  1. Total_Relationship_Count
    • With higher number of relationships with the bank, the customer is less likely to attrition
    • Makes sense as the customer would be more invested with the bank and cross-selling is a long accepted best practice in the banking industry
  2. Dependent_Count
    • Generally with higher number of dependents, attrition likelihood increases slightly except for dependents = 5 where it drops
  3. Months_Inactive_12_mon
    • We can ignore the 0 column as there are only 0.3% rows with 0 Months_Inactive_12_mon
    • For the other columns, attrition increases as months of inactivity increases (5 and 6 months show a slight increase but again there are only ~1.5% rows each for those values - not significant enough to draw trends)
  4. Contacts_Count_12_mon
    • As contacts increase, attrition increases up to 3 months but then there is a sharp drop from 3 to 4 and 4 to 5
    • 6 months is 100% attrition but only 0.5% rows have 6 months so too small to infer trends
  5. Card Category
    • Higher ranked cards are more likely to attrition than Blue card holders. However there are very few customers with Gold/ Silver/ Platinum
  6. Income Category
    • No clear trends here
  7. Marital Status
    • No significant trends vs target
  8. Gender
    • Sightly higher chance of attriting for Females (17.4% females attrited vs. 14.6% males)
  9. Education_Level
    • Higher probability of attriting with higher education levels - possibly

3A Dropping Columns, Splitting Train/ Test Sets

3B Imputing Missing Values using KNN

3C Reverse Encoding Categorical Columns

4A Logistic Regression

Logistic Regression - Oversampling

We try regularisation to fix overfitting

Logistic Regression - Undersampling

4B Ensemble Methods

Initial Inspection of models

Running kfolds on untuned models

Grid and Random Search CV

Logistic - Grid and RandomSearchCV

Now for Oversampling LR

Undersampling

Decision Tree - Grid and RandomSearch CV

Random Forest - Grid and RandomSearch CV

Bagging - Grid and RandomSearch CV

Adaboost - Grid and RandomSearch CV

Gradient Grid and RandomSearch CV

XGBoost Grid and RandomSearch CV

4 C Comparing All Models

Comparing Scores for best models found through grid cv

Comparing Scores for best models found through random search cv

Visualisation of scores across models as below

Feature Importances

5 Business Implications From This Exercise

Scoring

Modelling

Feature Importance